ICR OPINION WG2

Johannes B. Gruber

2023-09-28

Introduction

After the first round of annotations (after the pilot), let’s see how we are doing in term of intercoder reliability!

  1. Look at intercoder agreement
  2. Check the preliminary results

Top Coders!

Measurement Problems

  • Branching in codebook makes calculating intercoder-reliability tough
  • What does it mean when somebody answers an abstract uses a tool and someone clicked the research does NOT measure opinion (i.e., isn’t asked the question)?
  • options:
    1. treat IRRELEVANT as valid code?
    2. treat as missing and ignore (works only for Krippendorff’s \(\alpha\))?
    3. calculate agreement on full branch path?
unit_id 83 86 91 99 90 100
4775 IRRELEVANT IRRELEVANT IRRELEVANT IRRELEVANT IRRELEVANT IRRELEVANT
4567 Yes (explicitly) Yes (explicitly) Yes (explicitly) IRRELEVANT Yes (explicitly) Yes (explicitly)
1948 IRRELEVANT Yes (explicitly) IRRELEVANT IRRELEVANT IRRELEVANT IRRELEVANT

ICR

Is this research that measures a human opinion?

Agreement (all coders)
name value
n_Units 50
n_Coders 17
n_Categories 2
Krippendorffs_Alpha 0.7
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 2
Agreement 0.74
Holstis_CR 0.89
Krippendorffs_Alpha 0.73
Fleiss_Kappa 0.73
Brennan_Predigers_Kappa 0.73
Lotus 0.93
S_Lotus 0.85

What understanding or sub-category of opinion is measured here?

Agreement (all coders)
name value
n_Units 50
n_Coders 17
n_Categories 12
Krippendorffs_Alpha 0.54
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 10
Agreement 0.44
Holstis_CR 0.66
Krippendorffs_Alpha 0.56
Fleiss_Kappa 0.55
Brennan_Predigers_Kappa 0.44
Lotus 0.78
S_Lotus 0.75

New sub-categories of opinion?

value n
sentiment 6
emotion 3
opinion 3
identity-construction 1
topic 1
Aspect-based sentiment analysis 1
Consumer requirements and preferences 1
Feedback 1
Positions 1
Public perception 1
actionable tweet 1
attitude = sentiment? 1
attitudes 1
explanations for sentiment and sentiment 1
extracting aspect-based sentiment score per text item 1
feedback 1
feeling 1
framing 1
ideas 1
ideologies 1
importance weight and sentiment score towards an object/subject 1
opinions 1
position 1
public perceptions of certain topics 1
sentiment + topic 1
sentiment = attitude? 1
sentiument 1
they seem to conflate opinion and topics (but they still call it opinion) 1
to detect the topic being discussed and classify users' sentiments towards those topics 1

Is this research that uses a tool?

Agreement (all coders)
name value
n_Units 50
n_Coders 17
n_Categories 5
Krippendorffs_Alpha 0.53
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 5
Agreement 0.34
Holstis_CR 0.66
Krippendorffs_Alpha 0.53
Fleiss_Kappa 0.53
Brennan_Predigers_Kappa 0.34
Lotus 0.79
S_Lotus 0.74

Disagreement Examples

unit_id 83 86 91 99 90 100
1017 Yes (implicitly) Yes (explicitly) IRRELEVANT Yes (implicitly) Yes (explicitly) Yes (implicitly)
597 Yes (explicitly) Yes (explicitly) Yes (implicitly) Yes (implicitly) Yes (implicitly) Yes (implicitly)
1948 IRRELEVANT Yes (explicitly) IRRELEVANT IRRELEVANT IRRELEVANT IRRELEVANT

ID: 1948

Development and Evaluation of Speech Synthesis System Based on Deep Learning Models

This study concentrates on the investigation, development, and evaluation of Text-to-Speech Synthesis systems based on Deep Learning models for the Azerbaijani Language. We have selected and compared state-of-the-art models-Tacotron and Deep Convolutional Text-to-Speech (DC TTS) systems to achieve the most optimal model. Both systems were trained on the 24 h speech dataset of the Azerbaijani language collected and processed from the news website. To analyze the quality and intelligibility of the speech signals produced by two systems, 34 listeners participated in an online survey containing subjective evaluation tests. The results of the study indicated that according to the Mean Opinion Score, Tacotron demonstrated better results for the In-Vocabulary words; however, DC TTS indicated a higher performance of the Out-Of-Vocabulary words synthesis.

ID: 1017

NLP-Based Customer Loyalty Improvement Recommender System (CLIRS2)

Structured data on customer feedback is becoming more costly and timely to collect and organize. On the other hand, unstructured opinionated data, e.g., in the form of free-text comments, is proliferating and available on public websites, such as social media websites, blogs, forums, and websites that provide recommendations. This research proposes a novel method to develop a knowledge-based recommender system from unstructured (text) data. The method is based on applying an opinion mining algorithm, extracting aspect-based sentiment score per text item, and transforming text into a structured form. An action rule mining algorithm is applied to the data table constructed from sentiment mining. The proposed application of the method is the problem of improving customer satisfaction ratings. The results obtained from the dataset of customer comments related to the repair services were evaluated with accuracy and coverage. Further, the results were incorporated into the framework of a web-based user-friendly recommender system to advise the business on how to maximally increase their profits by introducing minimal sets of changes in their service. Experiments and evaluation results from comparing the structured data-based version of the system CLIRS (Customer Loyalty Improvement Recommender System) with the unstructured data-based version of the system (CLIRS2) are provided.

ID: 597

From Stances’ Imbalance to Their Hierarchical Representation and Detection

Stance detection has gained increasing interest from the research community due to its importance for fake news detection. The goal of stance detection is to categorize an overall position of a subject towards an object into one of the four classes: agree, disagree, discuss, and unrelated. One of the major problems faced by current machine learning models used for stance detection is caused by a severe class imbalance among these classes. Hence, most models fail to correctly classify instances that fall into minority classes. In this paper, we address this problem by proposing a hierarchical representation of these classes, which combines the agree, disagree, and discuss classes under a new related class. Further, we propose a two-layer neural network that learns from this hierarchical representation and controls the error propagation between the two layers using the Maximum Mean Discrepancy regularizer. Compared with conventional four-way classifiers, this model has two advantages: (1) the hierarchical architecture mitigates the class imbalance problem; (2) the regularization makes the model to better discern between the related and unrelated stances. An extensive experimentation demonstrates state-of-the-art accuracy performance of the proposed model for stance detection.

How would you categorize the measurment approach of the tool?

Agreement (all coders)
name value
n_Units 50
n_Coders 17
n_Categories 10
Krippendorffs_Alpha 0.41
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 10
Agreement 0.26
Holstis_CR 0.53
Krippendorffs_Alpha 0.39
Fleiss_Kappa 0.39
Brennan_Predigers_Kappa 0.26
Lotus 0.69
S_Lotus 0.66

New sub-categories of approaches?

value n
NiA 18
NA 3
LDA 2
N/A 2
NLP 2
nia 2
search engine 2
11 different classifiers 1
collocations) 1
dictionary 1
sentiment analysis (unspecified in abstract) 1
AEDA, LaBSE, BiLSTM 1
Classifier 1
Deep learning sentiment analysis based on a autoregressive language model 1
LDA + sentiment analysis (unspecified how) 1
Lexicon and supervised machine learning 1
Machine learning 1
Metaheuristics based long short term memory 1
Natural Language Tool 1
Not stated 1
Opinion Mining approach with unclear approach 1
Relational Network approach 1
SML, but unclear what type 1
Sentiment Analysis 1
Unclear 1
a new model for zero-shot stance detection 1
adaptive learning emotion identification method (ALEIM) based on mutual information feature weight 1
adversarial learning (might be DL or ML, the details are not in the abstract, and there is no link to the paper) 1
classic SML 1
crowd sourcing 1
dependency parsing 1
dictionary approach, unsupervised machine learning 1
lexicon-based method, bag-of-words module and semantic module 1
natural language explanation framework 1
natural language processing (NLP) 1
rule-based (dependency parsing) 1
rule-based (keywords-in-context 1
semi-supervised SML 1
shallow supervised ML 1
text mining 1
weakly supervised ML 1
weakly supervised learning paradigm 1
zero-shot stance detection on Twitter that uses adversarial learning 1

What software application or model or dictionary is employed by the research to measure the concept?

Agreement (all coders)
name value
n_Units 50
n_Coders 16
n_Categories 5
Krippendorffs_Alpha 0.51
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 5
Agreement 0.54
Holstis_CR 0.8
Krippendorffs_Alpha 0.52
Fleiss_Kappa 0.52
Brennan_Predigers_Kappa 0.54
Lotus 0.88
S_Lotus 0.85

Does the abstract mention or hint which data(set) was analysed?

Agreement (all coders)
name value
n_Units 50
n_Coders 15
n_Categories 3
Krippendorffs_Alpha 0.56
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 3
Agreement 0.42
Holstis_CR 0.71
Krippendorffs_Alpha 0.54
Fleiss_Kappa 0.54
Brennan_Predigers_Kappa 0.42
Lotus 0.82
S_Lotus 0.73

What kind of dataset is referenced?

Agreement (all coders)
name value
n_Units 50
n_Coders 15
n_Categories 5
Krippendorffs_Alpha 0.37
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 5
Agreement 0.32
Holstis_CR 0.63
Krippendorffs_Alpha 0.38
Fleiss_Kappa 0.38
Brennan_Predigers_Kappa 0.32
Lotus 0.76
S_Lotus 0.7

Does the abstract mention or hint at the natural language analysed?

Agreement (all coders)
name value
n_Units 50
n_Coders 15
n_Categories 3
Krippendorffs_Alpha 0.42
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 3
Agreement 0.48
Holstis_CR 0.74
Krippendorffs_Alpha 0.45
Fleiss_Kappa 0.45
Brennan_Predigers_Kappa 0.48
Lotus 0.82
S_Lotus 0.73

Does the abstract mention or hint at the country analysed?

Agreement (all coders)
name value
n_Units 50
n_Coders 15
n_Categories 3
Krippendorffs_Alpha 0.37
Agreement (full cases)
name value
n_Units 50
n_Coders 6
n_Categories 3
Agreement 0.48
Holstis_CR 0.72
Krippendorffs_Alpha 0.4
Fleiss_Kappa 0.4
Brennan_Predigers_Kappa 0.48
Lotus 0.81
S_Lotus 0.71

Results

Valid

Tentative

Results are based on a simple majority vote.

(Note: the wording of the question was wrong)